{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Late Chunking\n",
    "Late chunking is a technique to leverage the model’s long-context capabilities for generating contextual chunk embeddings. Include `late_chunking=True` in your request to enable contextual chunked representation. When set to true, Jina AI API will concatenate all sentences in the input field and feed them as a single string to the model. Internally, the model embeds this long concatenated string and then performs late chunking, returning a list of embeddings that matches the size of the input list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up chromadb and jina API key\n",
    "! pip install chromadb --quiet\n",
    "! pip install httpx --quiet\n",
    "! pip install pandas --quiet\n",
    "import os\n",
    "import chromadb\n",
    "import getpass\n",
    "import pandas as pd\n",
    "\n",
    "from chromadb.utils.embedding_functions import JinaEmbeddingFunction\n",
    "from chromadb.api.types import QueryResult\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_to_df(qr: QueryResult):\n",
    "    df = pd.DataFrame(qr[\"ids\"], columns=[\"id\"])\n",
    "    df[\"document\"] = qr[\"documents\"]\n",
    "    df[\"distance\"] = qr[\"distances\"]\n",
    "    return df\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"CHROMA_JINA_API_KEY\"] = getpass.getpass(\"Jina API Key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup\n",
    "\n",
    "Let's set up two collections to compare the difference in retrieval with late chunking on and off using `jina-embeddings-v3`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = chromadb.EphemeralClient()\n",
    "\n",
    "# create collection with Jina embedding function with late chunking enabled\n",
    "late_chunking_collection = client.create_collection(name=\"late_chunking\", configuration={\n",
    "    \"embedding_function\": JinaEmbeddingFunction(\n",
    "        model_name=\"jina-embeddings-v3\",\n",
    "        # enable late chunking\n",
    "        late_chunking=True,\n",
    "        task=\"text-matching\",\n",
    "    )\n",
    "})\n",
    "\n",
    "# create collection with Jina embedding function with late chunking disabled\n",
    "normal_collection = client.create_collection(name=\"normal\", configuration={\n",
    "    \"embedding_function\": JinaEmbeddingFunction(\n",
    "        model_name=\"jina-embeddings-v3\",\n",
    "        task=\"text-matching\",\n",
    "    )\n",
    "})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Documents & When to use Late Chunking\n",
    "\n",
    "Late chunking works best with Chroma when a group of documents share similar context. In this case, all documents are about Berlin, with documents referring to \"It\" and \"The city\". Normally, the model will not have the context to understand these are referring to Berlin, but with late chunking that context is now imbued with the other documents' embeddings.\n",
    "\n",
    "For best retrieval results, try to separate differing topics with separate adds when using late chunking. For example, if the first set of documents talks about Berlin, and the next refers to computer operating systems, it would be best to not contaminate one set of documents' context with the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up documents\n",
    "documents = [\n",
    "    'Berlin is the capital and largest city of Germany.', \n",
    "    'The city has a rich history dating back centuries.', \n",
    "    'It was founded in the 13th century and has been a significant cultural and political center throughout European history.', \n",
    "    'The metropolis experienced dramatic changes during the 20th century, including two world wars and a period of division.', \n",
    "    'After reunification, it underwent extensive reconstruction and modernization efforts.', \n",
    "    'Its population reached 3.85 million inhabitants in 2023, making it the most populous urban area in the country.', \n",
    "    'This represents a significant increase from previous decades, driven largely by immigration and economic opportunities.', \n",
    "    'The city is known for its vibrant cultural scene and historical significance.', \n",
    "    'Many tourists visit its famous landmarks each year, contributing significantly to the local economy.', \n",
    "    'The Brandenburg Gate stands as its most iconic symbol.'\n",
    "]\n",
    "\n",
    "ids = [str(i+1) for i in range(len(documents))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# add documents to the normal collection\n",
    "normal_collection.add(\n",
    "    ids=ids,\n",
    "    documents=documents,\n",
    ")\n",
    "\n",
    "# add documents to the late chunking collection\n",
    "late_chunking_collection.add(\n",
    "    ids=ids,\n",
    "    documents=documents,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Normal Collection Results:\n",
      "  id                                           document              distance\n",
      "0  1  [Berlin is the capital and largest city of Ger...  [0.7308309078216553]\n",
      "1  1  [Berlin is the capital and largest city of Ger...  [0.8990052342414856]\n",
      "\n",
      "--------------------------------\n",
      "\n",
      "Late Chunking Collection Results:\n",
      "  id                                           document              distance\n",
      "0  6  [Its population reached 3.85 million inhabitan...  [0.5808092951774597]\n",
      "1  3  [It was founded in the 13th century and has be...  [0.7019132375717163]\n"
     ]
    }
   ],
   "source": [
    "# let's query the normal collection and see the results\n",
    "results = normal_collection.query(\n",
    "    query_texts=[\"What is Berlin's population?\", \"When was Berlin founded?\"],\n",
    "    n_results=1,\n",
    ")\n",
    "\n",
    "print(\"Normal Collection Results:\")\n",
    "print(convert_to_df(results))\n",
    "\n",
    "\n",
    "print(\"\\n--------------------------------\\n\")\n",
    "\n",
    "# let's query the late chunking collection and see the results\n",
    "results = late_chunking_collection.query(\n",
    "    query_texts=[\"What is Berlin's population?\", \"When was Berlin founded?\"],\n",
    "    n_results=1,\n",
    ")\n",
    "\n",
    "print(\"Late Chunking Collection Results:\")\n",
    "print(convert_to_df(results))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
